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ABSTRACT 

Recent years have seen the development of methods for mul¬ 
tiagent planning under uncertainty that scale to tens or even 
hundreds of agents. However, most of these methods either 
make restrictive assumptions on the problem domain, or pro¬ 
vide approximate solutions without any guarantees on qual¬ 
ity. Methods in the former category typically build on heuris¬ 
tic search using upper bounds on the value function. Unfor¬ 
tunately, no techniques exist to compute such upper bounds 
for problems with non-factored value functions. To allow for 
meaningful benchmarking through measurable quality guar¬ 
antees on a very general class of problems, this paper intro¬ 
duces a family of influence-optimistic upper bounds for fac¬ 
tored decentralized partially observable Markov decision pro¬ 
cesses (Dec-POMDPs) that do not have factored value func¬ 
tions. Intuitively, we derive bounds on very large multiagent 
planning problems by subdividing them in sub-problems, and 
at each of these sub-problems making optimistic assumptions 
with respect to the influence that will be exerted by the rest 
of the system. We numerically compare the different upper 
bounds and demonstrate how we can achieve a non-trivial 
guarantee that a heuristic solution for problems with hun¬ 
dreds of agents is close to optimal. Furthermore, we provide 
evidence that the upper bounds may improve the effective¬ 
ness of heuristic influence search, and discuss further poten¬ 
tial applications to multiagent planning. 

1. INTRODUCTION 

Planning for multiagent systems (MASs) under un¬ 
certainty is an important research problem in artificial 
intelligence. The decentralized partially observable Mar¬ 
kov decision process (Dec-POMDP) is a general princi¬ 
pled framework for addressing such problems. Many re¬ 
cent approaches to solving Dcc-POMDPs propose to ex¬ 
ploit locality of interaction [22] also referred to as value 
factorization [16]. However, without making very strong 
assumptions, such as transition and observation inde¬ 
pendence [3], there is no strict locality: in general the 
actions of any agent may affect the rewards received in 
a different part of the system, even if that agent and 
the origin of that reward are (spatially) far apart. For 
instance, in a traffic network the actions taken in one 
part of the network will eventually influence the rest of 
the network [26|. 


A number of approaches have been proposed to gen¬ 
erate solutions for large MASs [40J [4§j [27] [47| [9] |36| . 
However, these heuristic methods come without guar¬ 
antees. In fact, since it has been shown that approxima¬ 
tion (given some e, finding a solution with value within e 
of optimal) of Dec-POMDPs is NEXP-complete [31], it 
is unrealistic to expect to find general, scalable methods 
that have such guarantees. However, the lack of guar¬ 
antees also makes it difficult to meaningfully interpret 
the results produced by heuristic methods. In this work, 
we mitigate this issue by proposing a novel set of tech¬ 
niques that can be used to provide upper bounds on the 
performance of large factored Dec-POMDPs. 

More generally, the ability to compute upper bounds 
is important for numerous reasons: 1) As stated above, 
they are crucial for a meaningful interpretation of the 
quality of heuristic methods. 2) Such knowledge of per¬ 
formance gaps is crucial for researchers to direct their 
focus to promising areas. 3) Such knowledge is also cru¬ 
cial for understanding which problems seem simpler to 
approximate than others, which in turn may lead to im¬ 
proved theoretical understanding of different problems. 
4) Knowledge about the performance gap of the leading 
heuristic methods can also accelerate their real-world 
deployment, e.g., when their performance gap is proven 
to be small over sampled domain instances, or when the 
selection of which heuristic method to deploy is facil¬ 
itated by clarifying the trade-off of computation and 
closeness to optimality. 5) Upper bounds on achievable 
value without communication may guide decisions on 
investments in communication infrastructure. 6) Last, 
but not least, these upper bounds can directly be used in 
current and future heuristic search methods, as we will 
discuss in some more detail at the end of this paper. 

Computing upper bounds on the achievable value of a 
planning problem typically involves relaxing the original 
problem by making some optimistic assumptions. For 
instance, in the case of Dec-POMDPs typical assump¬ 
tions are that the agents can communicate or observe 
the true state of the system 11 |35| [32] |25| . By exploit¬ 
ing the fact that transition and observation dependence 
leads to a value function that is additively factored into 
a number of small components (we say that the value 
function is ‘factored’, or that the setting exhibits ‘value 


factorization’), such techniques have been extended to 
compute upper bounds for so-called network-distributed, 
POMDPs (ND-POMDPs) with many agents. This has 
greatly increased the size of the problems that can be 
solved [38| >19, [9|. Unfortunately, assuming both tran¬ 
sition and observation independence (or, more gener¬ 
ally, value factorization) narrows down the applicability 
of the model, and no techniques for computing upper 
bounds for more general factored Dec-POMDPs with 
many agents are currently known. 

We address this problem by proposing a general tech¬ 
nique for computing what we call influence-optimistic 
upper bounds. These are upper bounds on the achiev¬ 
able value in large-scale MASs formed by computing lo¬ 
cal influence-optimistic upper bounds on the value of 
sub-problems that consist of small subsets of agents and 
state factors. The key idea is that if we make optimistic 
assumptions about how the rest of the system will influ¬ 
ence a sub-problem, we can decouple it and effectively 
compute a local upper bound on the achievable value. 
Finally, we show how these local bounds can be com¬ 
bined into a global upper bound. In this way, the major 
contribution of this paper is that it shows how we can 
compute factored upper bounds for models that do not 
admit factored value functions. 

We empirically evaluate the utility of influence-opti¬ 
mistic upper bounds by investigating the quality guar¬ 
antees they provide for heuristic methods, and by ex¬ 
amining their application in a heuristic search method. 
The results show that the proposed bounds are tight 
enough to give meaningful quality guarantees for the 
heuristic solutions for factored Dec-POMDPs with hun¬ 
dreds of agents]]] This is a major accomplishment since 
previous approaches that provide guarantees 1) have re¬ 
quired very particular structure such as transition and 
observation independence MM or ‘transition- 
decoupledness’ combined with very specific interaction 
structures (transitions of an agent can be affected in 
a directed fashion and only by a small subset of other 
agents) [42], and 2) have not scaled beyond 50 agents. In 
contrast, this paper demonstrates quality bounds in set¬ 
tings of hundreds of agents that all influence each other 
via their actions. 

This paper is organized as follows. First, Section[2]de- 
scribes the required background by introducing the fac¬ 
tored Dec-POMDP model. Next, Section|3]describes the 
sub-problems that form the basis of our decomposition 
scheme. Section [4] proposes local influence-optimistic 
upper bounds for such sub-problems together with the 
techniques to compute them. Subsequently, Section [5] 
discusses how these local upper bounds can be combined 
into a global upper bound for large problems with many 
agents. Section [6] empirically investigates the merits of 
the proposed bounds. Section [7] places our work in the 
context of related work in more detail, and Section [5] 
concludes. 

1 In the paper, we use the word ‘tight’ for its (empirical) 
meaning of “close to optimal”, not for its (theoretical CS) 
meaning of “coinciding with the best possible bound”. 


Figure 1: The FireFightingGraph problem. 

2. BACKGROUND 

In this paper we focus on factored Dec-POMDPs [26| , 
which are Dec-POMDPs where the transition and obser¬ 
vation models can be represented compactly as a two- 
stage dynamic Bayesian network (2DBN) [2]: 

Definition 1. A factored Dec-POMDP is a tuple M = 
(V,A,0,X ,T,O,lZ,b 0 ), where: 

• V = (1,... ,\D\} is the set of agents. 

• A = ®icT> Ai is the set of joint actions a. 

• O = &) ieT) Oi is the set of joint observations o. 

• X = {X 1 ,... ,X m } is a set of state variables, or 
factors, that take values Dom(X k ) and thus span 
the set of states S = ® x k ex Dom{X k ). 

• T(s'|s,a) is the transition model which is specified by 
a set of conditional probability tables (CPTs), one 
for each factor. 

• 0(o|a,s') is the observation model, specified by a 
CPT per agent. 

• TZ = {R\...,R P } is a set of local reward functions. 

• b° is the (factored) initial state distribution. 

Each local reward function R l has a state factor scope 
X(l) C X and agent scope D(l) C V over which is it is 
defined: R l (x x ^yax,(i),x'x(i)) £ K. These local reward 
functions form the global immediate reward function via 
addition. We slightly abuse notation and overload l to 
denote both an index into the set of reward functions, 
as well as the corresponding scopes: 

R(s,a,s') = R l (xi,ai,x' l ). 
len 

Every Dec-POMDP can be converted to a factored 
Dec-POMDP, but the additional structure that a fac¬ 
tored model specifies is most useful when the problem 
is weakly coupled, meaning that there is sufficient con¬ 
ditional independence in the 2DBN and that the scopes 
of the reward functions are small. 

For instance, Fig. |T] shows the FireFightingGraph 
(FFG) problem [27], which we adopt as a running ex¬ 
ample. This problem defines a set of \D\ + 1 houses, 
each with a particular ‘fire level’ indicating if the house 
is burning and with what intensity. Each agent can fight 
fire at the house to its left or right, making observations 
of flames (or no flames) at the visited house. Each house 
has a local reward function associated with it, which 
depends on the next-stage fire-level]^] as illustrated in 
Fig. [2jleft) which shows the 2DBN for a 4-agent instan¬ 
tiation of FFG. The figure shows that the connections 

2 FFG has rewards of form R\x [), but we support 
R l (xi,ai,x'i) in general. 





Figure 2: Left: A 2-agent sub-problem within 
4-agent FFG. Right: the corresponding IASP. 


are local but there is no transition independence [3] or 
value factorization 16, 42]: all houses and agents are 
connected such that, over time, actions of each agent 
can influence the entire system. While FFG is a stylized 
example, such locally-connected systems can be found 
in applications as traffic control [47] or communication 
networks 29 [l2 18 


This paper focuses on problems with a finite horizon h 
such that t = 0,... ,h — 1. A policy 7r; for an agent i 
specifies an action for each observation history of = 
(of,...,oj). The task of planning for a factored Dec- 
POMDP entails finding a joint policy n = {tvi, ... ,7T|x>|) 
with maximum value, i.e., expected sum of rewards: 


V (n) = E 


^R(s,a,s') | b° 


Such an optimal joint policy is denoted tv* [^] 

In recent years, a number of methods have been pro¬ 
posed to find approximate solutions for factored Dec- 


POMDPs with many agents 30 16 40 27 47 but 


none of these methods are able to give guarantees with 
respect to the solution quality (i.e., they are heuristic 
methods), leaving the user unable to confidently gauge 
how well these methods perform on their problems. This 
is a principled problem; even finding an e-approximate 
solution is NEXP-complete 31 , which implies that gen¬ 
eral and efficient approximation schemes are unlikely to 
be found. In this paper, we propose a way forward by 
trying to find instance-specific upper bounds in order to 
provide information about the solution quality offered 
by heuristic methods. 


3. SUB-PROBLEMS AND INFLUENCES 

The overall approach that we take is to divide the 
problem into sub-problems (defined here), compute over- 

3 We omit the on values; all values are assumed to be 
optimal with respect to their given arguments. 


estimations of the achievable value for each of these sub¬ 
problems (discussed in SectionBl and combine those into 
a global upper bound (SectionT5]. 


3.1 Sub-Problems (SPs) 

The notion of a sub-problem generalizes the concept 
of a local-form model (LFM) [28] to multiple agents and 
reward components. We give a relatively concise de¬ 
scription of this formalization, for more details, please 
see 


28 


Definition 2. A sub-problem (SP) A4 C of a factored 
Dec-POMDP A4 is a tuple A4 C = {MfD',X',1Z '), where 
V C D,X' C X, TZ! C 1Z denote subsets of agents, state 
factors and local reward functions. 


An SP inherits many features from At: we can de¬ 
fine local states x c £ &>xex' an d the subsets V,X',1Z' 
induce local joint actions Ac = -A, observations 

O c = 0 ig £>' Oi, and rewards 

R c (xc,a c ,x f c ) = R l (xi,ai,x'i). (1) 

ien' 

However, this is generally not enough to end up with a 
fully specified, but smaller, factored Dec-POMDP. This 
is illustrated in Fig. [2[left), which shows the 2DBN for 
a sub-problem of FFG involving two agents and three 
houses (dependence of observations o; on actions a; is 
not displayed). The figure shows that state factors X £ 
X' (in this case X 1 and X' l+2 ) can be the target of arrows 
pointing into the sub-problem from the non-modeled 
(dashed) part. We refer to such state factors as non- 
locally affected factors (NLAFs) and denote them xnf., 
where c indexes the SP and k indexes the factor. The 
other state factors in X' are referred to as only-locally 
affected factors (OLAFs) xlf. The figure clearly shows 
that the transition probabilities are not well-defined since 
the NLAFs depend on the sources of the highlighted 
influence links. We refer to these sources as influence 
sources m‘ + 1 = (j/i,a„) (in this case j/* = (A'* _1 ,A' 8+3 ) 
and af u = (a*_ 1 ,a* +2 )). This means that an SP c has an 
underspecified transition model: T c (a;‘ +1 |a:*,o(.,Wc +1 )- 

3.2 Structural Assumptions 

In the most general form, the observation and reward 
model could also be underspecified. In order to simplify 
the exposition, we make two assumptions on the struc¬ 
ture of an SP: 

1. For all included agents i £ T>’, the state factors that 
can influence its observations (i.e., ancestors of o; in 
the 2DBN) are included in A4 C - 

2. For all included reward components Pi € 1Z 1 , the 
state factors and actions that influence R l are in¬ 
cluded in Me- 

That is, we assume that SPs exhibit generalized forms 
of observation independence, 

Oc(o c \a c ,x' c ) = Pr(o c |a c ,Xc) = Pr(o c |a,s')> 






































and reward independence (cf. 0 ). These are more gen¬ 
eral notions of observation and reward independence 
than used in previous work on TOI-Dec-MDPs [3] and 
ND-POMDPs [22], since we allow overlap on state fac¬ 
tors that can be influenced by the agents themselves]^] 

Crucially, however, we do not assume any form of 
transition independence (for instance, the sets X' of SPs 
can overlap), nor do we assume any of the transition¬ 
decoupling (i.e., TD-POMDP [43]) restrictions. That 
is, we neither restrict which node types can affect ‘pri¬ 
vate’ nodes; nor do we disallow concurrent interaction 
effects on ‘mutually modeled’ nodes. 

This means that assumptions 1 and 2 (above) that 
we do make are without loss of generality: it is possi¬ 
ble to make any Dec-POMDP problem satisfythem by 
introducing additional (dummy) state factorsr] 


3.3 Influence-Augmented SPs 

An LFM can be transformed into a so-called influence- 
augmented local model, which captures the influence of 
the policies and parts of the environment that are not 
modeled in the local model 28 . Here we extend this 


approach to SPs, thus leading to influence-augmented 
sub-problems (IASPs). 

Intuitively, the construction of an IASP consists of 
two steps: 1) capturing the influence of the non-modeled 
parts of the problem (given 7r^ c the policies of non- 
modeled agents) in an incoming influence point I^ c (tv- c ), 
and 2) using this I-> c to create a model with a trans¬ 
formed transition model Ti_, c and no further depen¬ 
dence on the external problem. 

Step (1) can be done as follows: an incoming influence 
point can be specified as an incoming influence l\ c for 
each stage: J_> c = (/1> C ,... ,T^ C ) • Each such Ifl)) corre¬ 
sponds to the influence that the SP experiences at stage 
t+l, and thus specifies the conditional probability distri¬ 
bution of the influence sources u‘ +1 = That is, 

assuming that the influencing agents use deterministic 
policies n u = {n i)i Su that map observation histories to 
actions, lA)) is the conditional probability distribution 
given by 


I(u, 


t+i 


Dl +1 ) = 


^ =7r u (o*)} \ R , 


t+l 


,0 

b , 7 iy c 


where 1{.\ is the Kronecker Delta function, and D(A 1 
the d-separating set for lA( c : the history of a subset of 
all the modeled variables that d-separates the modeled 
variables from the non-modeled ones (3 


4 Previous work only allowed ‘external’ or ‘unaffectable’ 
state factors to affect the observations or rewards of mul¬ 
tiple components. 

5 In contrast, TOI-Dec-MDPs and ND-POMDPs impose 
both transition and observation independence, thereby 
restricting consideration to a proper subclass of those 
considered here. 


6 D(. +1 is defined such that Prfjy^oJl Dj +1 ,6°,7r^ c ,C‘) 
=Pr{yi,o*\D t + 1 ,b°,n^ c ), see 


28 


for details. 


Step (2) involves replacing the CPTs for all the NLAFs 
by the CPTs induced by /_> c . 

Definition 3. Let xn^’ t+1 be an NLAF (with index k), 
and u‘ +1 (the instantiation of) the corresponding in¬ 
fluence sources. Given the influence iAA) (n^ c ), and its 
d-separating set D\ +1 , we define the induced CPT for 
XHc’ t+1 as the CPT that specifies probabilities: 

, k,t+l t r.t+1 t\ 

p 7 t+i(a:n c ’ \x c ,D c ,a c ) = 

Y. Pr(xnS’ t+1 |a:‘,a‘,u‘ +1 )/(u‘ +1 |D‘ +1 ). (2) 

Finally, we can define the IASP. 

Definition 4. An influence-augmented SP (IASP) M( a 
= (Mc,Uc) for an SP M c = {M,V ,X',W) is a fac¬ 
tored Dec-POMDP with the following components: 

• The agents (implying the actions and observations) 
from the respective subproblem participate: V — V'. 

• The set of state factors is X = X' U { D c } such that 
states xi = (ic,D) +1 ) specify a local state of the SP, 
as well as the d-separating set D\ +l for the next- 
stage influences. 

• Transitions are specified as follows: For all OLAFs 
we take the CPTs from the factored Dec-POMDP M , 
but for all NLAFs we take the induced CPTs, leading 
to an influence-augmented transition model which is 
the product of CPTs of OLAFs and NLAFS: 


T I ^(x t + 1 \(x t c ,D t + 1 ),a t c ) = Pr(zf* +1 |aia*) 

Y Pr(xni +1 \xi,ae,u t c +1 )I(u t c +1 \D t c +1 ). 

u t c +1 = (y^,a t u ) 

(3) 


(Note that and D( +1 together uniquely 

specify D t c +2 ). 

• The observation model O follows directly from O 
(from M). 

• The reward is identical to that of the SP: TZ = TZ'. 

Fig-i right) illustrates the IASP for FFG. It shows 
how the d-separating set acts as a parent for all NLAFs, 
thus replacing the dependence on the external part of 
the problem. 

We write V c (tt) for the value that would be realized 
for the reward components modeled in sub-problem c, 
under a given joint policy 7r: 




~h-l 

Y R c(s,a,s) 


_t —0 



As one can derive, given the policies of other agents 
7i+ c , V c (I^c(n^c)), the value of the optimal solution of 
an IASP constructed for the influence corresponding to 
7i+ c , is equal to the best-response value: 

V^ R (ir^ c ) = maxI4(7r c ,7r 7 t c ) = V c (I^ c {n^c))- (4) 

7T c 

This extends the result in [28] to multiagent SPs. 







4. LOCAL UPPER BOUNDS 

In this section we present our main technical contri¬ 
bution: the machinery to compute influence-optimistic 
upper bounds (IO-UBs) for the value of sub-problems. 
In order to properly define this class of upper bound, we 
first define the locally-optimal value: 


compute the action-values for all local states: 
QOc>®c) = p r(Kl* +1 |a:*,4) p r(a:n* +1 |x*,a‘,u 

u c t+1 
X c 

[R c (4,a‘,4+ 1 ) + maxQ(^,4+ 1 )]. 


* +1 ) 

( 6 ) 


Definition 5. The locally-optimal value for an SP c, 
Vc° — max V^ r {tx^ c ) = max V r c (/_ >c (7r^ c )), (5) 

71-^c 

is the local value (considering only the rewards R c ) that 
can be achieved when all agents use a policy selected to 
optimize this local value. We will denote the maximizing 
argument by 7rS?. 


Comparing this equation to it is clear that this equa¬ 
tion is optimistic with respect to the influence: it selects 
the sources u‘ +1 in order to select the most beneficial 
transition probabilities. In the second phase, we use 
these values to compute an upper bound: 

V C M = max^ b° (x c )Q(x c ,a c ). 


Note that V c LO > V c (n *)—the value for the rewards R c 
under the optimal joint policy n *—since tv* optimizes 
the sum of all local reward functions: it might be op¬ 
timal to sacrifice some reward R c if it is made up by 
higher rewards outside of the sub-problem. 

Vt° expresses the maximal value achievable under a 
feasible incoming influence point; i.e., it is optimistic 
about the influence, but maintains that the influence 
is feasible. Computing this value can be difficult, since 
computing influences and subsequently constructing and 
optimally solving an IASP can be very expensive in gen¬ 
eral. However, it turns out computing upper bounds to 
V^° can be done more efficiently, as discussed in Sec¬ 
tion [4j4] 

The IO-UBs that we propose in the remainder of this 
section upper-bound V^° by relaxing the requirement 
of the incoming influence being feasible, thus allowing 
for more efficient computation. We present three ap¬ 
proaches that each overestimate the value by being op¬ 
timistic with respect to the assumed influence, but that 
differ in the additional assumptions that they make. 

4.1 A Q-MMDP Approach 

The first approach we consider is called influence- 
optimistic Q-MMDP (IO-Q-MMDP). Like all the heuris¬ 
tics we introduce, it assumes that the considered SP will 
receive the most optimistic (possibly infeasible) influ¬ 
ence. In addition, it assumes that the SP is fully ob¬ 
servable such that it reduces to a local multiagent MDP 
(MMDP) [5], In other words, this approach resembles Q- 
MMDP [35||25] , but is applied to an SP, and performs an 
influence-optimistic estimation of valueQ IO-Q-MMDP 
makes, in addition to influence optimism, another over¬ 
estimation due to its assumption of full observability. 
While this negatively affects the tightness of the upper 
bound, it has as the advantage that its computational 
complexity is relatively low. 

Formally, we can describe IO-Q-MMDP as follows. 
In the first phase, we apply dynamic programming to 


' What we have termed “Q-MMDP” has been referred to 
in past work as “Q-MDP”; we add the extra M emphasize 
the presence of multiple agents. 


This procedure is guaranteed to yield an upper bound 
to the locally-optimal value for the SP. 

Theorem 1. IO-Q-MMDP yields an upper bound to the 
locally-optimal value: V^° < U C M . 


Proof. An inductive argument easily establishes that, 
due to the maximization it performs, H is at least as 
great as the Q-MMDP value (for all ) of any feasi¬ 
ble influence, given by: 


r^MMDP ( < £ t-\£+ 1\ £\ \ ^ m / £+11 / £ t-v£+1\ £\ 
Qc ((•^cj-^c ),a c ) / J Tj_ >c yX c | \X C ,D C ) i&c) 

xP- 1 

[ji c (a:*,a*,:r‘ +1 ) + maxQ((a;* +1 ,D* +2 ),o‘ +1 )l. (7) 

L n.Z.' 4 


Therefore the value computed in 0 is at least as great 
as the Q-MMDP value 0 induced by tv^ (the max¬ 
imizing argument of 0 ), for all a : t c ,Dl +1 , 0 ^. This di¬ 
rectly implies 

t s. t rMMDP / t ( LO\\ 

v c > Vc 7 iy c )), 


Moreover, it is well known that, for any Dec-POMDP, 
the Q-MMDP value is an upper bound to its value [35| , 
such that 


t rMMDP / j / LO\\ -v. t t t r ( LO\\ 

V c (/->c(7T^c )) > ^(Uc(71> c ))• 

We can conclude that U C M is an upper bound to the 
Dec-POMDP value of the IASP induced by : 

V C M > Vc(Uc{i$£)) = maxl4(/^ c (7r #c )) = V c LO , 

with the identities given by 0, thus proving the theo¬ 
rem. □ 

The upshot of 0 is that there are no dependencies on 
d-separating sets and incoming influences anymore: the 
10 assumption effectively eliminates these dependencies. 
As a result, there is no need to actually construct the 
IASPs (that potentially have a very large-state-space) if 
all we are interested in is an upper bound. 




4.2 A Q-MPOMDP Approach 

The IO-Q-MMDP approach of the previous section 
introduces overestimations through influence-optimism 
as well as by assuming full observability. Here we tighten 
the upper bound by weakening the second assumption. 
In particular, we propose an upper bound based on the 
underlying multiagent POMDP (MPOMDP). 

An MPOMDP 21 1 is partially observable, but as¬ 
sumes that the agents can freely communicate their ob¬ 
servations, such that the problem reduces to a special 
type of centralized model in which the decision maker 
(representing the entire team of agents) takes joint ac¬ 
tions, and receives joint observations. As a result, the 
optimal value for an MPOMDP is analogous to that of 
a POMDP: 

Qtftf) = Rtf ,a*) + J2 P r (o t+1 | b t ,a t )Vtf +1 ), (8) 

0 t+i 


where b t+1 is the joint belief resulting from performing 
Bayesian updating of t/' given a* and o t+ . 

Using the value function of the MPOMDP solution 
as a heuristic (i.e., an upper bound) for the value func¬ 
tion of a Dec-POMDP is a technique referred to as Q- 
MPOMDP [32] [25]. Here we combine this approach 
with optimistic assumptions on the influences, leading to 
influence-optimistic Q-MPOMDP (IO-Q-MPOMDP). 

In case that the influence on an SP is fully specified, 
|8| can be readily applied to the IASP. However, we want 
to deal with the case where this influence is not specified. 
The basic, conceptually simple, idea is to move from 
the influence-optimistic MMDP-based upper bounding 
scheme from in Section [4.11 to one based on MPOMDPs. 
However, it presents a technical difficulty, since it is not 
directly obvious how to extend © to deal with partial 
observability. In particular, in the MPOMDP case as 
given by |8|, the state x * is replaced by a belief over 
such local states and the influence sources uf 1 affect 
the value by both manipulating the transition and ob¬ 
servation probabilities, as well as the resulting beliefs. 

To overcome these difficulties, we propose a formula¬ 
tion that is not directly based on Q, but that too makes 
use of ‘back-projected value vectors’. That is, it is pos¬ 
sible to rewrite the optimal MPOMDP value function 
as0 

Qtftf) = b* • r a + 7 Y'' max 6* • v ao , (9) 

/ v ao^yao 
o *+1 


where • denotes inner product and where v ao £ V ao are 
the back-projections of value vectors v £ V t+1 : 

u ao tf)=J2 0{o t+1 \a t ,s t+1 )T(s t+1 \s t a t )u(s t+1 ). (10) 

s t+i 

(Please see, e.g., [34j[33j for more details.) 

The key insight that enables carrying influence-opti¬ 
mism to the MPOMDP case is that this back-projected 

8 In this section and the next, we will restrict ourselves to 
rewards of the form R(s,a) to reduce the notational bur¬ 
den, but the presented formulas can be extended to deal 
with R(s,a,s r ) formulations in a straightforward way. 


form (JTO]) does allow us to take the maximum with re¬ 
spect to unspecified influences. That is, we define the 
influence-optimistic back-projection as: 

tfotfi )-max y2, 0(Oc +1 K,a:c +1 ) 

“o +1 t+1 

X c 

Pr(a:n‘ +1 |a;‘,ac,M* +1 ) Pr{xl t c +1 \x t c ,a c )vio{x t c +1 ). (11) 

Since this equation does not depend in any way on 
the d-separating sets and influence, we can completely 
avoid generating large IASPs. As for implementation, 
many POMDP solution methods 6i [T3] are based on 
such back-projections and therefore can be easily modi¬ 
fied; all that is required is to substitute these the back- 
projections by their modified form GD- When combined 
with an exact POMDP solver, such influence-optimistic 
back-ups will lead to an upper bound Vf, to which we 
refer as IO-Q-MPOMDP, on the locally-optimal value. 

To formally prove this claim, we will need to discrim¬ 
inate a few different types of value, and associated con¬ 
structs. Let us define: 

• &Lc {(x t c,D t + 1 )), an MPOMDP belief for the IASP 
induced by an arbitrary influence /_►<,, 

• Vf , the optimal value function when the IASP is 
solved as an MPOMDP, such that Vtf c (btf c ) is the 
value of btf c and V c MPOMDP {I^ c tfVtf c (btf J. 
Vf +o is represented using vectors v £ V. 

• b\o {Xc), an arbitrary distribution over x ‘ that can be 
thought of as the MPOMDP belief for the ‘influence 
optimistic SP’(^] 

• Vfo the value function computed by an (exact) 
influence-optimistic MPOMDP method, that assigns 
a value Vfo,ctfio ) to any &/o- Vf 0 ,c is represented 
using vectors vio £ Vio- The IO-Q-MPOMDP up¬ 
per bound is defined by plugging in the true ini¬ 
tial state distribution b°, restricted to factors in c: 
Vf = Vf (b° IO ). 

First we establish a relation between the different vectors 
representing Vtf o and Vf 0c . 

Lemma 1. Let ntf -1 be a (h—t)-steps-to-go policy. Let 
v £ V and vio £ Vio be the vectors induced by n ph ~ 1 
under regular MPOMDP back-projections (for some I^ c ), 
and under 10 back-projections respectively. Then 

M x t maxv((ic,flc +1 )) < vio(xc). 

D* +1 

Proof. The proof is fisted in Appendix [A] □ 

9 For instance, in the case of FFG from Fig. [2] we can 
imagine an SP that encodes optimistic assumptions by 
assuming that the neighboring agents always will fight 
fire at houses i and i + 2. Even though it may not be 
possible to define such a model for all problems—the 
optimistic influence could depend on the local (belief) 
state in intricate ways—this gives some interpretation 
to b t IO (x t c ). Additionally, we exploit the fact that con¬ 
struction of an optimistic SP model is possible for the 
considered domains in Section [6] 






This lemma provides a strong result on the relation 
of values computed under regular MPOMDP backups 
versus influence-optimistic ones. It allows us to establish 
the following theorem: 

Theorem 2. For an SP c, for all /-> c , 

Vb i j_ tc < Vk),c(Pio)j 

provided that b^o concides with the marginals ofb t I ^ c : 

(Al) V sS Y bUF(XcM +1 }) = b t io(x t c ). 

D* +1 

Proof. We start with the left hand side: 

Vf (&}) = max bj ■ v 

c i/GV c 

= max Y ^ ((x t c ,D t c +1 ))v((x t c ,D t c +1 )) 

{x%,D t + 1 ) 

<max^ Y b t i^ c ((xl,Di +1 ))m!wi>((x t c ,D t c +1 )) 

1/£ D t+1 D o 

{Al} = max Y^ bioizV) max i'((x{.,.D} +1 }) 

i/ev D t+i 

x c 

{Lemma [TJ- 

< max YF t io{x\)vio{.x t c ) 

xl 

Tt 

= max bjQ • vio 

v IO^ v IO 

= Vfo,c(p t Io)i 

thus proving the theorem. Q 

Corollary 1. IO-Q-MPOMDP yields an upper bound 
to the locally-optimal value: V^° < Vf. 

Proof. The initial beliefs are defined such that the above 
condition (Al) holds. That is: 

Y b°i^ c (( x c> D l = 0)) = 6?o(®c) = b a (xl). 
o{ +1 

Therefore, application of Theorem [2] to the initial belief 
yields: 

Vi^c V c = V IO ,c(bio) > V I ^ c (b I _, c ) = V c (i->o) 

It is well-known that the MPOMDP value is an upper 
bound to the Dec-POMDP value [25], such that 

t rMPOMDP / t r LO\\ -v. T/ (t ( LO\\ 

V c (-/->c(TT ^c )) > VcU-^clTT^c )), 

and we can immediately conclude that 

Vf > K(/^ c ( 4?)) = maxy c (/_ c (7r #c )) = V C L °, 

with the identities given by ©> proving the result. □ 


4.3 A Dec-POMDP Approach 

The previous approaches compute upper bounds by, 
apart from the 10 assumption, additionally making op¬ 
timistic assumptions on observability or communication 
capabilities. Here we present a general method for com¬ 
puting Dec-POMDP-based upper bounds that, other than 
the optimistic assumptions about neighboring SPs, make 
no additional assumptions and thus provide the tight¬ 
est bounds out of the three that we propose. This ap¬ 
proach builds on the recent insight 17 [8] [24] that a Dec- 
POMDP can be converted to a special case of POMDP 
(for an overview of this reduction, see [23]); we can 
thereby leverage the influence-optimistic back-projection 
© to compute an IO-UB that we refer to as IO-Q-Dec- 
POMDP. 

As in the previous two sub-sections, we will leverage 
optimism with respect to an influence-augmented model 
that we will never need to construct. In particular, as 
explained in Section [3] we can convert an SP M c to an 
IASP M\ a given an influence Since such an I ASP 
is a Dec-POMDP, we can convert it to a special case of 
a POMDP: 


Definition 6. A plan-time influence-augmented sub¬ 
problem, is a tuple M PT ~ 1A (M c ,I^c) = 

(S ,A,Ti_^ c ,R,0,0,h,b°) , where: 

• <S is the set of states s t = (a:*,o',}} = (4,^c +1 ,Oc). 

• A is the set of actions, each af corresponds to a local 
joint decision rule (5} in the SP. 

• T/^ c (s t+1 |s t ,a t ) is the transition function defined 
below. 

. R(S*,a 4 ) = Rc{x* c &{3Z)). 

• O = {NULL}. 

• O specifies that observation NULL is received with 
probability 1 (irrespective of the state and action). 

• The horizon is not modified: h = h. 

• b° is the initial state distribution. Since there is only 
one o° (i.e., the empty joint observation history). 

The transition function specifies: 


f I ^ a {(x t c +1 ,D t + 2 ,o c t+1 
= f I ^ c {x t c +1 \{x t c ,D t + 1 


) | (xl,D*+\3 *} ,5l) 
),5 t c {oc))d(o t c +1 \S t c {d*),x 


t+1 

C 


) 


if o c t+1 = (dc,Oc +1 ) mid 0 otherwise. In this equation 
Tr_ >o ,0 are given by the IASP (cf. Section pf]P*] 


This reduction shows that it is possible to compute 
V r c *(/_ >c (7r^ c )), the optimal value for an SP given an in¬ 
fluence point J_> c , but the formulation is subject to the 
same computational burden as solving a regular IASP: 
constructing it is complex due to the inference that needs 
to be performed to compute and subsequently solv¬ 
ing the IASP is complex due to the large number of 
augmented states s 4 = (a:},,-D); +1 ,o c 4 ). 

10 Remember that D{. +2 is a function of the specified 
quantities: D} +2 = d(s}.,D} +1 ,(5}(o c t ),x}, +1 ). 



Fortunately, here too we can compute an upper bound 
to any feasible incoming influence, and thus to V^° , by 
using optimistic backup operations with respect to an 
underspecified model, to which we refer as simply plan¬ 
time SP: 

Definition 7. We define the plan-time sub-problem 
Mc T a s an under-specified POMDP (Ai c ,-) = 
(S,A,T(.),R,O,O,h,b 0 ) with 

• states of the form s 4 = (x 4 ,o c 4 }, 

• an underspecified transition model 

T(.)(s 4+1 |s 4 ,a 4 )^ 

T c (xl +1 1 xlSi(o* ) ,u 4+1 ) O c (o 4+1 15 4 (o c 4 ) ,x 4+1 ), 

• and A,R,0,0,h,b° as above. 

Since this model is a special case of a POMDP, the 
theory developed in Section |4~2| applies: we can maintain 
a plan-time sufficient statistic a c,t (essentially the ‘be¬ 
lief’ b over augmented states s 4 = (i 4 ,o c 4 )) and we can 
write down the value function using ©■ Most impor¬ 
tantly, the IO back-projection also applies, which 
means that (similar to the MPOMDP case) we can avoid 
ever constructing the full PT-IASP. The IO back-projec¬ 
tion in this case translates to: 

4(4,Oc)-max Pr (°c +1 | fic(°*)>xt +1 ) 

u c t + 1 

x c 

Pr(xn 4+1 |x 4 ,5 4 (o c 4 ),u 4+1 ) Pr(x/ 4+1 |x 4 ,c> 4 (o c 4 )) 

uio(x t c +1 ,o c t+1 ). (12) 

Here, we omitted the superscript for the NULL obser¬ 
vation. Also, note that 0(o 4+1 |a 4 ,x 4+1 ) in (JTTJ) cor¬ 
responds to the NULL observation in the PT model, 
but since the observation histories are in the states, 
Pr(o 4+1 |<5 4 (o c 4 ),x 4+1 ) comes out of the transition model. 

Again, given this modified back-projection, the IO-Q- 
Dec-POMDP value V® can be computed using any exact 
POMDP solution method that makes use of vector back- 
projections; all that is required is to substitute these the 
back projections by their modified form (121. 


Corollary 2. IO-Q-Dec-POMDP yields an upper bound 
to the locally-optimal value: V^° < V,P. 

Proof. Directly by applying Corollary [I] to _Mf r . □ 

4.4 Computational Complexity 

Due to the maximization in ©, ([TT]) and |l2j, IO 
back-projections are more costly than regular (non-IO) 
back-projections. Here we analyze the computational 
complexity of the proposed algorithms relative to the 
regular, non-IO, backups. 

We start by comparing IQ-Q-MMDP backup opera¬ 
tion © to the regular MMDP backup for an SP that 
does not have incoming influences. For such an SP, the 


MMDP backup is given by 

Q(Xc,Oc) = Pr(x t c +1 lx t c ,a t c ) 

A + 1 

[7? c (a: 4 ,a 4 ,a: 4+1 ) + maxQ(x 4+1 ,a 4+1 )l. (13) 

L a 4+1 J 

Comparing © and we see two differences: 1) the 
transition probabilities can be written in one term since 
we do not need to discriminate NLAFs from OLAFs, and 
2) there is no maximization over influence sources u 4+1 . 
The first difference does not induce a change in compu¬ 
tational cost, only in notation: in both cases the entire 
transition probability is given as the product of next- 
stage CPTs. The second difference does induce a change 
in computational cost: in (|6|, in order to select the max¬ 
imum, the inner part of the right-hand side needs to be 
evaluated for each instantiation of the influence sources 
|u 4+1 |. That is, the computational complexity of a Q- 
MMDP backup (for a particular (x 4 ,a 4 )-pair) is 

o(\x c \ X KI), 

whereas the total computational cost of the IO-Q-MMDP 
backup is: 

0(|m 4+1 | x \X C \ x |a c |). 

As such, the complexity of each backup is multiplied 
by the number of influence source instantiations |u 4+1 |. 
This means that the overhead of IO-Q-MMDP relative 
to Q-MMDP is 0(|w 4+1 |). 

A similar comparison of the overhead of the IO-Q- 
MPOMDP back projection (111 with respect to the reg¬ 
ular Q-MPOMDP back projection ( |10[ ) leads to exactly 
the same relative overhead of 0(|«c |). Since IO-Q- 

Dec-POMDP makes use of this last result, also in this 
case the relative overhead is 0(|u 4+1 |). 

Concluding, the computational complexity of the meth¬ 
ods we proposed to compute IO-UBs is given by mul¬ 
tiplying the computational complexity of the ‘underly¬ 
ing’ MMDP, MPOMDP or Dec-POMDP method with 
|u 4+1 |. This means that the relative overhead, is equal 
for all methods. Since the amount of overhead is a lin¬ 
ear function of |m 4+1 |, it is problem dependent: more 
densely coupled problems will lead to a higher overhead, 
while very loosely coupled problems have a low overhead. 

5. GLOBAL UPPER BOUNDS 

We next discuss how the methods to compute local 
upper bounds can be employed in order to compute a 
global upper bound for factored Dec-POMDPs. 

The basic idea is to apply a non-overlapping decom¬ 
position C (i.e., a partitioning) of the reward functions 
{ R! } of the original factored Dec-POMDP into SPs c G 
C, and to compute an IO upper bound Vj° for each 
(which can be any of the three IO-UBs proposed in Sec¬ 
tion [dj). Our global influence-optimistic upper bound is 
then given by: 

V 10 ±Y^Vc°- (14) 

cGC 




6 i >:0 






. 


Figure 3: Illustration of construction of a global 
upper bound for 6-agent FFG using influence- 
optimistic upper bounds for sub-problems. 


Fig. 0 illustrates the construction of a global upper 
bound V for the 6-agent FFG, showing the original 
problem (top row) and two possible decompositions in 
SPs. The second row specifies a decomposition into two 
SPs, while the third row uses three SPs. The illustra¬ 
tion clearly shows how (in this problem) a decomposition 
eliminates certain agents completely and replaces them 
with optimistic assumptions: E.g., in the second row, 
during the computation of Vj° for both SPs (c = 1,2) 
the assumption is made that agent 3 will always fight 
fire in the SP under concern. Effectively we assume that 
agent 3 fights fire at both house 3 and house 4 simulta¬ 
neously (and hence is represented by a superhero figure). 
Fig-0 also illustrates that, due to the line structure of 
FFG, there are two types of SPs: ‘internal’ SPs which 
make optimistic assumptions on two sides, and ‘edge’ 
SPs that are optimistic at just one side. 

Finally, we formally prove the correctness of our pro¬ 
posed upper bounding scheme. 


Theorem 3. LetC be a partitioning of the reward func¬ 
tion set 1Z into sub-problems such that every R l is rep¬ 
resented in one SP c, then the global IO-UB is in fact 
an upper bound to the optimal value V 10 > V(7r*). 

Proof. Starting from the definition ( |14| ), we have 

{Section 0 


v 1 


cec cec 

Y'' max V^ r (tv^ c ) 

• ^ 7T 7 _ 


cec 


= y^maxmax 14(7^,^) 


egC 


> max) Vc(n c ,7r^c) 

7T ‘ ^ 

cec 


{Cis a partition} 


V(n* 


thus proving the result. 


□ 


6. EMPIRICAL EVALUATION 

In order to test the potential impact of the proposed 
influence-optimistic upper bounds, we present numerical 
results in the context of a number of benchmark prob¬ 
lems. In this evaluation, we focus on the (relative) val¬ 
ues found by these heuristics, as we hope that these will 


spark a number of interesting ideas for further research 
(such as the notion of ‘influence strength’, its relation 
to approximability of factored Dec-POMDPs, and the 
key idea that reasonable bounds for very large problems 
may be possible). We do not thoroughly investigate 
timing results as the analysis of Section |4.4| indicates 
that relative timing results follow those of regular (non- 
10) MMDP, MPOMDP and Dec-POMDP methods; see, 
e.g., [25] for a comparison of such timing results. How¬ 
ever, in order to provide a overall idea of the run times, 
we do provide some indicative running times P^] 

6.1 Comparison of Different Bounds 

The bounds that we propose are ordered in tight¬ 
ness, V r n < Vf < Vj, M , similar to how regular (non- 
10) Dec-POMDP, Q-MPOMDP, and Q-MMDP values 
relate [25]. To get an understanding of how these dif¬ 
ferences turn out in practice, Fig. [deleft) compares the 
different upper bounds introduced. 

Although the approach described in the paper is gen¬ 
eral, in the numerical evaluation here we exploit the 
property that the optimistic influences are easily iden¬ 
tified off-line, which allows for the construction of small 
‘optimistic Dec-POMDPs’ (respectively MPOMDPs or 
MMDPs) without sacrificing in bound quality. E.g., in 
order to compute the local IQ-Q-Dec-POMDP upper 
bound for a 3-house FFG ‘edge’ SP, we define a regular 
3-house Dec-POMDP where the transitions probabili¬ 
ties for the first house (say X' in Fig. [ 2 ]) are modified to 
account for the optimistic assumption that another (su¬ 
perhero) agent fights fire there and that its neighbor is 
not burning (i.e., at -1 = right and .Y‘ _1 = notJburning 
in Fig. [ 2 ]. The values V r D (resp. Vf ,V™) are computed 
by running a state-of-the-art Dec-POMDP solver [24| 
(resp. incremental pruning [6], plain dynamic program¬ 
ming) on such optimistically defined problems. 

Fig. [4}left) shows the values for such ‘edge’ problems. 
Missing bars indicate time-outs (>4h). As an indica¬ 
tion of run time, the \V'\ = 3,h = 5 problem took 2.21s 
for IO-Q-MMDP, and 995.28s for IO-Q-Dec-POMDP. 
The shown values indicate that V r ! : , V r f can be tighter 
than V t M in practice. In most cases, the difference be¬ 
tween IO-Q-MPOMDP and IO-Q-Dec-POMDP is small, 
but these could become larger for longer horizons [25| . 
We performed the same analysis for the Aloha bench¬ 
mark ]27], and found very similar results. 

We also compare the bounds found on the different 
types of SPs (internal and edge-cases, see Fig. [ 3 ] en¬ 
countered in FFG (h = 4). In addition, Fig. [4} middle) 
also includes—if computable within the allowed time— 
values of SPs that are ‘full’ problems (i.e., the regular 
optimal Dec-POMDP value for the full FFG instance 
with the indicated number of agents.) This makes clear 
that the optimistic assumption has quite some effect: 
being optimistic at one edge more than halves the opti¬ 
mal cost, and the 10 assumption at both edges of the SP 
leads to another significant reduction of that cost. This 

11 All experiments are run on a Intel Xeon E5-2650L, 
32GB system making use of one core only. 











Figure 4: Left: Different IO-UBs on ‘edge’ FFG problems. Middle: IO-Q-Dec-POMDP upper bound 
on different types (internal, edge, full) of SPs. Right: The impact the ‘influence strength’. 


is to be expected: the optimistic problems assume that 
there always will be another agent fighting fire at the 
house at an optimistic edge, while the full problem never 
has another agent at that same house. When also tak¬ 
ing into account the transition probabilities—two agents 
at a house will completely extinguish a fire—it is clear 
that the lO assumption should have a high impact on 
the local value. 

6.2 The Effect of Influence Strength 

Fig. [^middle) makes clear that the lO assumption 
in FFG is quite strong and leads to a significant over¬ 
estimation of the local value compared to the ‘full’ prob¬ 
lem. We say that FFG has a high influence strength. In 
fact, this hints at a new dimension of the qualification of 
weak coupling [44] that takes into account the variance 
in NLAF probabilities (as a function of the change of 
value of the influence source) and their impact on the 
local value. 

As a preliminary investigation of this concept, we de¬ 
vise a modification of FFG where the influence strength 
can be controlled. In particular, we parameterize the 
probability that a fire is extinguished completely when 
2 agents visit the same house, which is set to 1 in the 
original problem definition. Lower values of this proba¬ 
bility mean that optimistically assuming there is another 
agent at a house will lead to less advantage, and thus 
lower influence strength. 

Fig. fright) shows the results of this experiment. It 
shows that there is a clear relation between the fire- 
extinguish probability when two agents fight fire at a 
house, and the ratio between the ‘regular’ (Dec-POMDP) 
value and optimistic value. It also shows that SPs with 
more agents are less affected: this makes sense since op¬ 
timistic assumptions account for a smaller fraction of the 
achievable value. In other words, larger sub-problems 
give a tighter approximation. 

6.3 Bounding Heuristic Methods 

Here we investigate the ability to provide informative 
global upper bounds. While the previous analysis shows 
that the overestimation is quite significant at the true 
edges of the problem (where no agents exist), this is not 
necessarily informative of the overestimation at internal 
edges in decompositions of larger problems (where other 
agents do exist, even if not superheros). As such, besides 
investigating the upper bounding capability, the analy¬ 
sis here also provides a better understanding of such 



Figure 5: The global IO-Q-Dec-POMDP upper 
bound on large FFG instances. 


internal overestimations. 

We use the tightest upper bound we could find by 
considering different SP partitions, with sizes ranging 
from | = 2-5, and investigate the guarantees that it 
can provide for transfer planning (TP) [27], which is one 
of the methods capable of providing solutions for large 
factored Dec-POMDPs. Since the method is a heuris¬ 
tic method that does not provide the exact value of the 
reported joint policy, the value of TP, V TP , is deter¬ 
mined using 10.000 simulations of the found joint policy 
leading to accurate estimates]^] To put the results into 
context, we also show the value of a random policy. Fi¬ 
nally, we show (second y-axis in Fig. [5j) what we call the 
empirical approximation factor (EAF ): 


EAF = max{ 


yl° yTP i 

V TP ' \no ' 


This is a number comparable to the approximation fac¬ 
tors of approximation algorithms 


39 


We computed upper bounds for large, horizon h = 4, 
FFG instances. The computation of the local upper 
bounds for the largest SPs used (i.e., \T>'\ = 5) took 
3.31 secs for IO-Q-MMDP and 2696.23s for IO-Q-Dec- 
POMDP. Fig. [5] shows the results that indicate that the 
upper bound is relatively tight: the solutions found by 
TP are not too far from the upper bound. In partic¬ 
ular, the EAF lies typically between 1.4 and 1.7, thus 


12 Note that there is no method for Dec-POMDP policy 
evaluation that runs in polynomial time. In fact, exis¬ 
tence of such a method would reduce the complexity of 
solving a Dec-POMDP to NP, an impossibility since the 
time hierarchy theorem implies that NPy^ NEXP. 
13< Empirical’ emphasizes the lack of a priori guarantees. 




































\v\ 

50 

75 

100 

250 

v TP 

-71.99 

-111.07 

-148.70 

-382.47 

\DO 

-72.00 

-107.06 

-144.00 

-360.00 

EAF 

1.00 

1.04 

1.03 

1.06 


Table 1: Empirical approximation factors for 
Aloha (h = 3) with varying number of agents. 


providing firm guarantees for solutions of factored Dec- 
POMDPs with up to 700 agents. Moreover, we see that 
the EAF stays roughly constant for the larger prob¬ 
lem instances indicating that relative guarantees do not 
degrade as the number of agents increase. Of course, 
the question of whether the optimal value lies closer to 
the blue (UB) or orange (TP) line remains open; only 
further research on improved (heuristic) solution meth¬ 
ods and tighter upper bounds can answer that question. 
However, we have gone from a situation where the only 
upper bound we had was ‘predict Rmax for every stage’ 
(which corresponds to the value 0 and EAF=oo) to a sit¬ 
uation where we have a much more informative bound. 

Results obtained for a similar approach for Aloha us¬ 
ing SPs containing up to \T>'\ = 6 agents are shown in 
Table [l] The numbers clearly illustrate that it is possi¬ 
ble to provide very strong guarantees for problems up to 
\D\ — 250 agents (beyond which memory forms the bot¬ 
tleneck for TP); the solution for the \T>\ = 50 instance 
is essentially optimal, indicating also a very tight bound 
for this problem. 



h- 

—2 

h- 

=3 

h- 

=4 

h= 5 


HouseSearch 

l : Diamond : 

: Pr( correct obs.) 

= 0.75 


# 

s 

# 

s 

# 

s 

# s 

M 

55 

2.36 

245 

33.86 

2072 

1179 


P 

55 

2.42 

241 

34.21 

1744 

1010 


M’ 

45 

2.00 

91 

9.00 

183 

64.32 

97 991.2 

P’ 

45 

2.02 

91 

10.05 

119 

39.84 

66 607.9 


HouseSearch : Diamond 

: Pr (correct obs.) = 0.5 


# 

s 

# 

s 

# 

s 

# S 

M 

47 

3.35 

225 

51.80 

1434 

1935 


P 

47 

3.32 

223 

52.75 

1416 

1905 


M’ 

23 

1.96 

24 

4.69 

41 

38.42 

148 1882 

P’ 

11 

0.89 

15 

3.90 

19 

25.32 

23 256.8 


HouseSearch : Sq■ 

uares : 

Pr (correct obs.) 

= 0.75 


# 

s 

# 

s 

# 

s 


M 

25 

2.41 

311 

120.5 

1323 

5846 


P 

25 

2.56 

311 

121.3 

1032 

7184 


M’ 

25 

2.61 

277 

109.6 

816 

3924 


P’ 

25 

2.53 

240 

95.28 

752 

6760 



Table 2: Node count and run time for A*-OIS. 


sometimes outweighed by the increased computational 
overhead of a more complex heuristic calculation (such 
as in Squares HouseSearch h = 4). The question of pre¬ 
dicting in advance when this will be the case is interest¬ 
ing, but difficult (and orthogonal to our contribution). 

7. RELATED WORK 

Here we provide an overview of related approaches. 


6.4 Improved Heuristic Influence Search 

Aside from analyzing the solution quality of approx¬ 
imate methods, our bounds can also be used in opti¬ 
mal methods. In particular, A*-OIS [4T| solves TD- 
POMDPs (a sub-class of factored Dec-POMDPs) by de¬ 
composing them into 1-agent SPs, searching through 
the space of influences, and pruning using optimistic 
heuristics. However, existing A*-OIS heuristics treat the 
unspecified-influence stages of the SPs as fully-observable. 
In contrast, IO-Q-MPOMDP models the partial observ¬ 
ability of the SPs. 

We now present results that suggest an added compu¬ 
tational benefit to treating partial observability in the 
A*-OIS heuristics. Table [5] illustrates the differences 
in pruning afforded by four different A*-OIS heuristics: 
M is the baseline MDP-based heuristic from [4l], P 
is shorthand for IO-Q-MPOMDP, and M’ and P’ are 
variations that improve tightness with locally-derived 
probability bounds on the optimistic influences. We re¬ 
port node counts and runtime across several problem 
instances from the HouseSearch domain [41] . 

As shown, the POMDP-based heuristics tends to al¬ 
low for more pruning (fewer expanded nodes) and thereby 
reduced computation in comparison to their MDP-based 
counterparts. Contrasting this reduction across two vari¬ 
ants of Diamond HouseSearch suggests that the IO-Q- 
MPOMDP heuristics gain more advantage when observ¬ 
ability is more restricted. However, this advantage is 


Scalable Heuristic Methods. 

In recent years, many scalable approaches without 
guarantees have been developed for Dec-POMDPs and 


related models [l4| [46| [48| [l6| [40| [3T| [2T| [47| [36]. The 


upper bounding mechanism that we propose could be 
useful for benchmarking many of these methods. 


Sub-Problems v.v. Source Problems. 

The use of sub-problems (SPs) is conceptually simi¬ 
lar to the use of source problems in transfer planning 
(TP) [27], Differences are that, in our partitioning- 
based upper bounding scheme, SPs are selected such 
that they do not contain overlapping rewards, while TP 
allows for overlapping source problems. Moreover TP 
does not consider optimistic influences but implicitly 
assumes arbitrary influences. Finally, TP is used as a 
way to compute a heuristic for the original problem; our 
approach here simply returns a scalar value (although 
extensions to use 10 heuristics for heuristic search for 
factored Dec-POMDPs are an interesting direction for 
future research). 

Optimism with Respect to Influences. 

Being optimistic with respect to external influences 
has been considered before. For instance, Kumar & 
Zilberstein 14] make optimistic assumptions on tran¬ 
sitions in an ND-POMDP to derive an MMDP-based 
policy which is used to sample belief points for memory- 




























bounded dynamic programming. The approach does 
not use these assumptions to upper bound the global 
value and the formulation is specific to ND-POMDPs. 
As described in the experiments, Witwicki, Oliehoek & 
Kaelbling 41 use a local upper bound in order to per¬ 


form heuristic influence search for TD-POMDPs. IO-Q- 
MMDP can be seen as a generalization of that heuris¬ 
tic to both factored Dec-POMDPs and multiagent sub¬ 
problems, while our other heuristics additionally deal 
with partial observability. 

Quality Guarantees for Large Dec-POMDPs. 

As mentioned in the introduction, a few approaches to 
computing upper bound for large-scale MASs and em¬ 
ploying them in heuristic search methods have been pro¬ 
posed. In particular, there are some scalable approaches 
with guarantees for the case of transition and observa¬ 
tion independence as encountered in TOI-MDPs and 
ND-POMDPs [31 |2| [381 mm [9]- The approach by 


Witwicki for TD-POMDPs |42]Sect. 6.6] is a bit more 
general in that it allows some forms of transition depen¬ 
dence, as long as the interactions are directed (one agent 
can affect another) and no two agents can affect the same 
factor in the same stage. In addition, scalability relies 
on each agent having only a handful of ‘interaction an¬ 
cestors’. 

However, these previous approaches rely on the true 
value function being factored as the sum of a set £ of 
local value components: 

y(7r) = ]TVe(0, 

where n e is the local joint policy of the agents that par¬ 
ticipate in component e. (This is also referred to as the 
‘value factorization’ framework 16 ). For this setting, 
an upper bound is easily constructed as the sum of local 
upper bounds: 


V(f 


= ££(*')■ 

eES 


While this resembles our 10 upper bound (14 1 , the cru¬ 


cial distinction is that for value factorized settings com¬ 
puting V e does not require any influence-optimism: the 
reason that value-factorization holds is precisely because 
there are no influence sources for the components. As 
such, our influence-optimistic upper bounds can be seen 
as a strict generalization of the upper bounds that have 
been employed for settings with factored value functions. 
Investigating if such methods, such as the method by 
Dibangoye et al. |9] can be modified to use our IO-UBs 
is an interesting direction of future work. 

Finally, the event-detecting multiagent MDP [T5] pro¬ 
vides quality guarantees for a specific class of sensor net¬ 
work problems by using the theory of submodular func¬ 
tion maximization. It is the only previous method with 
quality guarantees that delivers scalability with respect 
to the number of agents without assuming that the value 
function of the problem is additively factored into small 
components. 


8. CONCLUSIONS 

We presented a family of influence-optimistic upper 
bounds for the value of sub-problems of factored Dec- 
POMDPs, together with a partition-based decomposi¬ 
tion approach that enables the computation of global 
upper bounds for very large problems. The approach 
builds upon the framework of influence-based abstrac- 
but—in contrast to that work -makes opti- 


28 


tion 

mistic assumptions on the incoming ‘influences’, which 
makes the sub-problems easier to solve. An empirical 
evaluation compares the proposed upper bounds and 
demonstrates that it is possible to achieve guarantees for 
problems with hundreds of agents, showing that found 
heuristic solution are in fact close to optimal (empirical 
approximation factors of < 1.7 in all cases and some¬ 
times substantially better). This is a significant contri¬ 
bution, given the complexity of computing e-approximate 
solutions and the fact that tight global upper bounds are 
of crucial importance to interpret the quality of heuristic 
solutions. 

Intuitively, the proposed approach is expected to work 
well in settings where the Dec-POMDP is ‘weakly’ cou¬ 
pled. The work by Witwicki & Durfee [44 identifies 
three dimensions that can be used to quantify the no¬ 
tion of weak coupling. Our experiments suggest the ex¬ 
istence of a new dimension that can be thought of as 
influence strength. This dimension captures the impact 
of non-local behavior on local values and thus directly 
relates to how well a problem can be approximated using 
localized components. 

In this paper we focused on the finite-horizon case, 
but the principle of influence optimism underlying the 
upper-bounding approach can also be applied in infinite- 
horizon settings. Furthermore, the approach can be 
modified to compute ‘pessimistic’ influence (i.e., lower) 
bounds, which could be useful in competitive settings, 
or for risk-sensitive planning [20 . It is also immedi¬ 
ately applicable to problems involving ‘unpredictable’ 
dynamics [45| [7], Finally, our upper-bounding method 
contributes a useful precursor for techniques that auto¬ 
matically search the space of possible upper bounds de¬ 
compositions, efficient optimal influence-space heuristic 
search methods (for which we provided preliminary ev¬ 
idence in this paper), and A* methods for a large class 
of factored Dec-POMDPs. In particular, a promising 
idea is to employ our factored upper bounds in combi¬ 
nation with the heuristic search methods by Dibangoye 
et al. 9 . While it is not possible to directly use that 
method since it additionally requires a factored lower 
bound function, pessimistic-influence bounds could pro¬ 
vide those. 

A limitation of the current approach is that sub-prob¬ 
lems still need to be relatively small, since we rely on 
optimal optimistic solution of the sub-problems. Devel¬ 
oping more scalable ‘optimistic solution methods’ thus 
is an important direction of future work. Experiments 
with influence search indicate that using probabilistic 
bounds on the positive influences has a major impact [41 . 
As such, another important direction of future work 








is investigating if it is possible to develop tighter up¬ 
per bounds by making more realistic optimistic assump¬ 
tions. 
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APPENDIX 


A. PROOFS 

Lemma [l] Let n^' h ~ 1 be a (h — t)-steps-to-go policy. Let v £ V and vio £ Vio be the vectors induced by 7 r* :,l_1 
under regular MPOMDP back-projections (for some /_> c ), and under 10 back-projections. Then 

V** maxv((i‘,D‘ +1 )) < i/ JO (ic). 

D * +1 


Proof. The proof is via induction. The base case is for the last stage t = h — 1, in which case the vectors only consist 
of the immediate reward: 

maxi^((a:‘,D* +1 >) = max f? a t (z‘) = R a t{xl) = v IO {x*^), 

£>c +1 _D * +1 

where a* is the joint action specified by thus proving the base case. The induction step follows. 

Induction Hypothesis: Suppose that, given that v £ V and vio £ Vio are vectors for the same policy Ttl +1 ' h ~ 1 , 
for all xt +1 

maxi/((x‘ +1 ,P‘ +2 )) < iz/o(®c +1 ) (15) 

D t+2 


holds. 

To prove: Given that v € V and ujo £ Vio are vectors for the same policy 7r * :h_1 , for all x* c 


max u((x c ,D c + }) < v IO (x c ). 


(16) 


Proof: 

We first define the vectors from the l.h.s. Let a* denote the first joint action specified by 7Te _1 . Then we can write 

1 '(( 4 ,£> c +1 >) = r a {x *) + 7 ^ a °((^Dt +1 )) (17) 

O t + 1 

where iy 7r l a ° is the back-projection of the vector T(7r* :,l_1 ,ac,0c +1 ) that corresponds to nl +l h ~ 1 = Jj-o.o (the 

sub-tree of 7r‘ :,1_1 given a*,o* +1 ). That is, by filling out the definition of back-projection, we get 


^((x*,n* +i )) = ra(xi )+7 

<,*+! 


Y. 0(o t+1 |a\4 +1 )T^ c (4 +1 |(4^c +1 >,Oc)[r(7r* : ' l - 1 ,a t c ,o t c +1 )]((4 +1 ,^ +2 )) 


where Dc +2 is specihed as a function of x* c ,Dl +1 ,a*,Xc +1 ■ Clearly, introducing a maximization can not decrease the 
value, so 

K(Zc,£>* +1 » < r a (x t c )+ ■y'Y Y 0{o t+1 \ a \xl +1 )T I ^ c (x^ 1 \(xl,D t c +1 ),a t c ) max[r(7r* : ' l-1 ,o*,o‘ +1 )]((a:* +1 ,H* +2 )) 

„t+i 4+1 D ‘ + 

{LH.} < r a (i‘)+ 7 ^] ^0(o t+ 1 | a t ,4 + 1 )T/^(*‘ + 1 |<^,^ + 1 >,a‘)[r/o(7ry- 1 ,a‘,o‘ + 1 )](4 +1 ) 

<.*+! X1+ 1 


where r/o(^c X ) is the 10 vector that corresponds to n, 

Now we define the r.h.s. vector: 

Vioixi) = r a (x *) + 7 ^o’ 10 "^') 


,t+l:h —1 _ t:h-l 


_ — I || 

— TTc vfl,o • 


= r„(i‘)+7^ 

o*+l 


max ^ 0(o‘ +1 |Oci a; c +1 )P r (*«'c +1 kc; a t:i u c +1 ) Pr ( :z; ^ +1 | a: c 5 ac)[rro(7rc ,t 1 ,o‘,o‘ +1 )](a:‘ +1 ) 
u c +1 

We need to show that 


t+1 

U C t +1 


(19) 


r4**)+7 5Z < ^(° t+1 | at ’ a: c +1 )-P I ->c( a 'c +1 |{ a 'o!7?* +1 ),a*)[r/o(7r* :h 1 ,a*,o* +1 )](a:* +1 ) 

o t+1 

< r a (:r*) + 7 ^ max ^ 0(o‘ +1 |a c ,a:‘ +1 ) Pr(:m* +1 |a;*,a c ,u‘ +1 ) Pr(*l* +1 |a:c I a =)[ r ^o(7rc :,l_1 ,a‘,o‘ +1 )](a:‘ +1 ) 

o‘+! “ c 7+ 1 


(18) 








which holds if and only if 


max V E 0{° t+1 1 a *)T/_,c(E" 1 1 ( x c^c +1 )J a c) [r/o(7r* :h 1 ,a‘,o‘ +1 )](x* +1 ) 

D « + ^+i ~j+i 

< Y E 0(o* +1 |a c ,x* +1 )Pr(®ne +1 |x*,a c ,w* +1 )Pr(x/‘ +1 |*c ! ac)[r/o(7r* :h_1 ,a*,o* +1 )](x* +1 ). (20) 


£ + 1 

ot+1 u c t +l 


To show this is the case, we start with the l.h.s.: 


E I]0(o t+1 |a t yc +1 )T , ^ c (x‘ +1 Kx t c ,^ +1 ),a*)[r / o(^ : ' l “ 1 ,ac,o* +1 )](x‘ +1 ) 


c oW 3-*+! 


= max Y Y 0(° t+1 |a*,x* +1 '' 


Y Pr(xn\ +1 \xt,a t c ,u t + 1 )I(u t + 1 \D t + 1 ) 


Pr(il* +1 |x* ,«*) [f/o (ic 1 ,ac,o* +1 )] (i* +1 ) 


= max Y Y ^( M c +1 |-Dc +1 ) E 0(oE a VE)Pr(£E h Ec,acEEP r (®E 1 |£e,ac)[P/o(7rc h \a},o* +1 )](Zc +1 ) 


E E 7 ( m * +1 p; 


t+i\ 


o t+1 „*+! 


E 0(o‘ +1 |a t ,x* +1 )Pr(xn‘ +1 |x‘,o‘,M* +1 ) Pr^EEEcHrroE' 1 1 ,a*,o‘ +1 )](x‘ +1 ) 


<{max. of a function is greater than its expectation:} 


max 

^ t+i 


-jt+i ' „,*+ 

c ot+1 


E 0 ( ot+1 |° t £ 2: c +1 ) Pr(xn‘ +1 |x*,a*,«* +1 )P r (^c +1 |*c,ac)[rro(7r* : '* 1 ,a‘,o* +1 )](x* +1 ) 


={no dependence on _D* +1 anymore:} 

E E 0(o t+1 | flt £ a: c +1 )Pr(xn* +1 |Xc,ac,Mc +1 )Pr(x4 +1 |a:ci a c)[r/o(7rc h_1 ,ac,o* +1 )](Xc +1 ) 


t+i 

t + l «e t + 1 


which is the r.h.s. of (201, the inequality the we needed to demonstrate, thereby finishing the proof. 
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